Extreme Value Statistics and Traveling Fronts: An Application to Computer Science 
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We study the statistics of height and balanced height in the binary search tree problem in computer 
science. The search tree problem is first mapped to a fragmentation problem which is then further 
mapped to a modified directed polymer problem on a Cayley tree. We employ the techniques 
of traveling fronts to solve the polymer problem and translate back to derive exact asymptotic 
properties in the original search tree problem. The second mapping allows us not only to re-derive 
the already known results for random binary trees but to obtain new exact results for search trees 
where the entries arrive according to an arbitrary distribution, not necessarily randomly. Besides it 
allows us to derive the asymptotic shape of the full probability distribution of height and not just 
its moments. Our results are then generalized to m-ary search trees with arbitrary distribution. 



PACS numbers: 89.20.Ff, 02.50.-r, 89.75.Hc 
I. INTRODUCTION 

The techniques developed in statistical physics, par- 
ticularly in the theory of spin glasses, have been recently 
applied to a variety of problems in theoretical computer 
science §. These include various optimization problems 
such as the traveling salesman problem S, graph parti- 
tioning Q , satisfiability problems [Q , the knapsack prob- 
lem |^| , the vertex covering problem error correct- 
ing codes R|, number partitioning problems match- 
ing problems and many others |l0[ |. The purpose of 
this paper is to study analytically certain problems in a 
different area of theoretical computer science known as 
sorting and searching H]. The standard techniques of 
spin glass theory are not directly suitable for these prob- 
lems. Instead, we employ the techniques developed to 
study the propagation of traveling fronts in various non- 
linear systems Il2| - [l8| . 

The "sorting and searching" is an important area of 
computer science that deals with the following basic ques- 
tion: How to organize or sort out the incoming data so 
that the computer takes the minimum time to search for 
a given data if required later? Amongst various search 
algorithms, the binary search turns out to be one of the 
most efficient [pd| . To understand this algorithm, let us 
start with a simple example. Suppose the incoming data 
string consists of the twelve months of the year appearing 
in the following order: July, September, December, May, 
April, February, January, October, November, March, 
June and August. Suppose later we need to look for 
the month of August in this data string. Consider first 
the sequential search where the computer starts from the 
first element (July) , checks if it is the right month and if 
not, moves to the next element of the string (September), 
checks the element there and continues in this fashion un- 
til it finds the right month. In the example above, to find 
the month 'August', the computer has to make 12 com- 
parisons. Thus, the sequential search algorithm is rather 



inefficient as it typically takes a search time of order N, 
where N is the number of entries in the data string. 

In a binary search, on the other hand, the typical 
search time scales as logA^ JTl| . The binary search is 
implemented by organizing the data string on a tree ac- 
cording to the following algorithm. An order is first cho- 
sen for the incoming data, e.g., it can be alphabetical 
or chronological (January, February, March, etc.). Let 
us choose the chronological order. Now the first element 
of the input string (July) is put at the root of a tree 
(see Fig. 1). The next element of the string is Septem- 
ber. One compares with the root element (July) and sees 
that September is bigger than July (in chronological or- 
der). So one assigns September to a daughter node of the 
root in the right branch. On the other hand, if the new 
element ware less than the root, it would have gone to 
the daughter node of the left branch. Then the next el- 
ement is December. We compare at the root (July) and 
decide that it has to go to the right, then we compare 
with the existing right daughter node (September) and 
decide that December has to go to the node which is the 
right daughter of September. The process continues till 
all the elements are assigned their nodes on the tree. For 
the particular data string in the above example, we fi- 
nally get the unique tree as shown in Fig. 1. Such a tree 
is called a binary search tree (BST). 

Once this tree is constructed, the subsequent search, 
say for the month of August, takes much less number of 
comparisons. We start with the root (July). Since the 
sought after element August is bigger than July, we know 
that it must be on the right branch of the two daughter 
sub-trees. This already eliminates searching roughly half 
of elements which are on the left sub-tree. We next en- 
counter the key September. Since August is less than 
September, we go to the left and thus we do not need to 
search anymore the right branch of the sub-tree rooted at 
September. Once we go to the left, we find the required 
key August. 



1 




FIG. 1. The binary search tree corresponding to the data 
string in the order: July, September, December, May, April, 
February, January, October, November, March, June and Au- 
gust. 

Thus, the BST algorithm requires only 3 comparisons 
as opposed to 12 operations in the sequential search. 
Since the typical search time is proportional to the depth 
of an element in the tree and since the typical depth D 
is related to the total size N via 2 D w N, the search 
time scales as log AT, making the BST algorithm one of 
the most efficient search algorithms. 




FIG. 2. The binary search tree corresponding to the data 
string in the order: May, November, August, April, Decem- 
ber, February, June, September, July, January, October and 
March. 

If the incoming data string had a different order of ap- 
pearance, one would have obtained a different BST. For 
example, suppose the months appear in a different order 
as: May, November, August, April, December, Febru- 
ary, June, September, July, January, October and March. 
For this data string, the same algorithm of constructing 



a binary tree as before gives a tree of different shape 
(Fig. 2). Each permutation of the incoming data string 
leads to a different binary tree and there are N\ pos- 
sible binary trees for any incoming data string with N 
entries. Usually the incoming data string appears in a 
random fashion. This would indicate that each of the N\ 
possible binary trees occurs with equal probability. Such 
trees, each generated with equal probability, are called 
'random binary search trees' (RBST). Of course, if the 
incoming data is not completely random, the probability 
measure over the space of trees will not be uniform. The 
results derived in this paper will be applicable not only 
to RBST but also to more general BST's with arbitrary 
measure. 

Each BST has several observables (such as the depth 
or the height of a tree) associated with it that quantify 
the efficiency of the underlying search algorithm. Hence 
the knowledge of the statistics of such observables are of 
central importance. Here are a few observables: 

• Dm is the depth (distance from the root) of the last 
inserted element in a given BST of size N . For example, 
Dn = 3 for the tree in Fig. 1 (counting the depth of the 
root element as 1). Each BST has a different D jy, so Dm 
is a random variable. The average depth (-Djv) (averaged 
over the probability measure of the trees) gives a mea- 
sure of the average time required to insert a new element 
in a tree [jll| . 

• H n is the height of a given tree, defined as the depth 
of the farthest node from the root. For example, ffjv = 5 
for the BST in Fig. 1. Clearly Hm is also a random vari- 
able and (Hn) gives a measure of the maximum possible 
time that could be required to insert an element, i.e. a 
measure of the worst case scenario Jl!|-|22j . 

• /ijv is the balanced height defined as the maximum 
depth from the root up to which the tree is fully balanced, 
i.e., all the nodes up to this depth are fully occupied 

In the tree shown in Fig. 1, = 3, whereas hjy = 2 in 
the tree in Fig. 2. Hence /iat is also a random variable 
whose statistics is important. 

Some of the observables mentioned above such as the 
height Hn and the balanced height are of extremal 
nature, i.e., they are the maximum or the minimum of a 
set of correlated random variables. In this paper we limit 
ourselves only to such extreme observables of the binary 
tree. While the statistics of the cxtrcmum of a set of un- 
corrected random variables is well understood [^3] p5[ , 
little is known about the same for correlated variables 
pfjfl. However, in the present problem the random vari- 
ables are correlated in a special hierarchical way which 
facilitates analysis. We will see that the extreme vari- 
ables in the BST problem satisfy nonlinear recursion re- 
lations that admit traveling front solutions in some suit- 
able variables. A lot is known about the speed and the 
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shape of such fronts appearing in various nonlinear sys- 
tems 

||-|!| . Below, we will use these techniques to 
study the statistics of extreme variables in the binary 
tree problem. 

Some of our results for the RBST were already known 
which we will mention as we go along. However, the ap- 
proach used here is quite different from those used by the 
computer scientists. Computer scientists tend to estab- 
lish upper and lower bounds to the quantity of interest 
(typically the average value or the variance of the observ- 
able) and then tighten the bounds [ pl| . If the bounds 
coincide, one obtains an exact result j pl[ . Our approach, 
on the other hand, is a typical physicist's approach. The 
methods we use may not always be rigorous in the strict 
mathematical sense, but they lead to exact asymptotic 
results in a physically transparent way. Moreover, our 
approach allows us not only to reproduce already known 
asymptotics for the average height and the average bal- 
anced height of the RBST, but also to obtain new in- 
formation about the variance and even the asymptotic 
shapes of the full probability distributions. Besides, our 
method goes beyond the RBST and yields exact results 
for trees generated with arbitrary distributions. 

Our approach utilizes two exact mappings which can 
be summarized as follows. Following Devroye (27), we 
first map the RBST problem to a random fragmentation 
problem where an object of initial length N breaks ran- 
domly into two fragments, each of which further breaks 
randomly into two parts, and so on. The fragmentation 
problem is interesting on its own right as it appears in the 
context of various physical problems such as the energy 
cascades in turbulence ]2§|], rapture processes in earth- 
quakes ^9|, financial crashes in stock marke ts p0| , and 
the stress propagation in granular medium Some 
of the extremal problems in the random fragmentation 
problem were studied recently by Hattori and Ochiai J3^] 
and by us |33]| . The method used in our previous short 
paper p3| allowed us to obtain exact asymptotic results 
for the average of the maximal piece of the 2™ fragments 
after n iterations. The statistics of this maximal piece 
is closely related to the statistics of the height in the 
RBST. However, this method was not easy to extend to 
the cases beyond the random fragmentation, i.e. when 
the break point is chosen from an arbitrary distribution, 
not necessarily uniform. We will see later that a fragmen- 
tation problem with a given break point distribution cor- 
responds to a BST problem where the incoming entries 
to the tree appear according to a specific distribution and 
not just randomly. 

In this paper, we show that the fragmentation problem, 
with arbitrary break point distribution, can further be 
mapped onto a modified directed polymer (MDP) prob- 
lem on a Cayley tree. The MDP problem differs from the 
conventional directed polymer (DP) problem on a Cay- 
ley tree studied by Derrida and Spohn [M due to the 
presence of a special constraint in the MDP. Derrida and 



Spohn were mostly interested in the finite temperature 
spin glass transition in the DP problem. Our problem 
reduces to a zero temperature problem, albeit with a spe- 
cial constraint. We then solve this MDP problem using 
traveling front techniques and translate back to derive ex- 
act asymptotic results for the original BST problem. We 
will see that the statistics of the height Hn of the BST 
problem is related (via the two successive mappings) to 
the statistics of the minimum or the ground state energy 
of the MDP problem. On the other hand, the statistics 
of the balanced height h n will be related to that of the 
maximum energy of the directed polymer (a quantity of 
little interest in statistical physics framework) . This sec- 
ond mapping also allows us to obtain new exact results 
for nonrandom BST problem. 

The paper is organized as follows. In Sect. I, we set up 
notation, review known results for the RBST problem, 
and summarize novel results obtained in this paper. Sec- 
tion II contains the exact mapping of the BST problem 
to the fragmentation problem. In Sect. Ill, we map the 
fragmentation problem to the MDP problem. In Sect. IV, 
we derive the exact nonlinear recursion relations in the 
MDP problem and analyze them using the traveling front 
techniques. The main results for the RBST problem are 
also derived in this section. In Sect. V, we go beyond 
the random trees and derive exact results for the frag- 
mentation problem with arbitrary break point distribu- 
tion. Section VI contains the generalization to the case 
of m-ary trees with arbitrary distributions. Wc finally 
conclude in Sect. VII with a summary and outlook. 

II. BINARY SEARCH TREES: OLD AND NEW 
RESULTS 

Let us label the incoming data string of N elements 
by integers 1, . . ., N. For example, if the data string 
consists of the 12 months of the year, we can label, say 
the month of January by 1, the month of February by 
2, and so on. In that example, N = 12. A specific data 
string will then be isomorphic to a corresponding ordered 
sequence of these integers. For example, the particu- 
lar sequence of months in Fig. 1 reduces to the ordered 
sequence [7,9,12,5,4,2,1,10,11,3,6,8]. A different se- 
quence of months will correspond to a different permuta- 
tion of these integers. Each such sequence or permutation 
will then correspond to a separate BST, constructed by 
the algorithm explained in the introduction. In RBST, 
all these N\ sequences (and their corresponding trees) 
occur with equal probability. 

We will focus on the statistics of the extreme variables 
associated with these trees, in particular the height Hn 
and the balanced height of a BST as defined in the 
introduction. Each BST has a unique value of Hn and 
h j\[ . Since the trees occur with a given probability distri- 
bution (which is uniform in case of RBST), both Hn and 
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h n are random variables. Of interest are the statistics of 
these variables such as the average, variance or even the 
full probability distributions of Hn and hjy. 

The RBST problem has been studied for a long time 
by computer scientists and we now mention a few known 
results. Devroye proved that for large N, the average 
height of a RBST (Hn) w atylogN where the constant 
cto = 4.31107.... Hattori and Ochiai conjectured that 
the true asymptotic behavior of (Hn) has an additional 
sub-leading double logarithmic correction, 



(H N ) w a log N + ai log (log N) , 



(1) 



and they determined the constant a± ps —1.75 numeri- 
cally pig] . Using traveling front techniques we confirmed 
the above asymptotics and computed the correction term 
analytically, a% — — 3ao/[2(ao — 1)] = —1-95303 . . . 
The same result was simultaneously proved by Reed 
Based on numerical data, Robson conjectured |3q| that 
the variance is bounded. Recently, Drmota |37[| has 
proved that all moments ((-Hat — (Hn)) 171 ) are bounded. 

For the balanced height fiN of RBST, Devroye showed 
that the leading asymptotic behavior of the average 
balanced height is given by (Iin) w ao'logTV where 
<V = 0.3733. . . [§l|f7j. Indeed, a in (H N ) and a ' in 
(/ijv) turn out to be the two solutions of the same tran- 
scendental equation (2e/a) a = e [ plp7| ]. This suggests 
some kind of duality between the height and the balanced 
height. We will show later that the correct asymptotic 
behavior of (/i/v) is given by 



(h N ) S3 ao' log N + ai' log (log N) , 



(2) 



where relation a[ — — 3a' a /[2(a a — 1)] holds again. Dr- 
mota has recently proved that all the moments of /ijv are 
also bounded |37| as in the case of Hn- 

Note that all the results mentioned above are for RBST 
with fixed size N. Recently by using a rate equation ap- 
proach we studied the statistics of height and balanced 
height for randomly growing binary trees where the av- 
erage size of a tree grows with time linearly (N(t)) ~ t 
Ipsj . The expected height and balanced height for large 
random binary trees were found to have exactly the same 
asymptotic formulas (|l])-(||), provided one replaces N by 
(N(t)) in these equations. This approach is thus remi- 
niscent of the grand canonical approach in statistical me- 
chanics with the time t playing the role of the chemical 
potential which can be chosen to fix the average size. In 
this paper, we focus only on the canonical approach, i.e., 
trees with fixed given size N, since this is more familiar 
in theoretical computer science. 

We will exploit a two stage mapping "BST problem — > 
fragmentation problem — > MDP problem" and use the 
traveling front technique to analyze the MDP problem. 
This technique allows to re-derive in a physically trans- 
parent way all results for the RBST mentioned above 
and provides a lot of new results. We will show that the 



constants ao and ao' are simply related to the velocities 
of traveling fronts. The sub- leading correction terms can 
also be derived analytically. The traveling front approach 
also predicts 'concentration of measure' of the variables 
Hn and hj^. This means that the asymptotic probabil- 
ity distributions of these variables are highly localized 
around their respective averages. As a result, a typical 
value of Hjy ~ (Hn) and the spread in Hn is of or- 
der 0(1) in the large N limit. Naturally the variance 
and higher cumulants of both Hn and Hn are bounded. 
We also derive an asymptotically exact nonlinear integral 
equation for the full probability distributions of Hn and 
hN- While we could not solve this nonlinear equation in 
closed form, we could derive the behaviors at the tails 
of these highly localized distribution functions. We will 
also see that within this approach the variables Hn and 
Hn map respectively onto the minimum and maximum 
energy of a directed polymer and hence the observed du- 
ality between them is rather natural. 

The main advantage of the present approach is that 
it allows us to go beyond the random trees and obtain 
exact asymptotic results for the statistics of Hn and Jin 
for BST's with arbitrary distributions. This is the main 
new result of the present paper. Besides, we also gen- 
eralize basic results to m-ary search trees with arbitrary 
distributions. 



III. MAPPING OF THE BST PROBLEM TO A 
FRAGMENTATION PROBLEM 

In order to derive the asymptotics of the statistics of 
the height and the balanced height in the BST problem, 
it is convenient to first map this problem to a fragmen- 
tation problem following Devroye |2^,^]. To illustrate 
how this mapping works, let us consider again the ex- 
ample in Fig. 1 where the months (or the corresponding 
integers from 1 to 12) appear in the particular sequence 
[7,9,12,5,4,2,1,10,11,3,6,8]. The first element (which 
in this example is 7) is chosen randomly from the avail- 
able N — 12 elements in the case of RBST. Once this 
element is chosen, the remaining elements will belong 
either to the interval [1 — 6] or [8 — 12], which are subse- 
quently completely disconnected from each other. Thus 
choosing the first element is equivalent to breaking the 
original interval [1 — 12] into two intervals, the left [1 — 6] 
and the right [8 — 12] at the break point 7 which is chosen 
randomly. Now consider the next element. It will cither 
belong to the left or the right interval. In the particular 
example we are discussing, the next element 9 belongs 
to the right interval [8 — 12]. This new element then di- 
vides the right interval [8 — 12] again into two parts: the 
left containing only the single element [8] and the right 
[10 — 12]. These two new intervals subsequently become 
completely independent of each other. The third element 
[12] breaks subsequently the interval [10 — 12] into two 
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parts: the left part [10 — 11] and the right part which is 
empty. Similarly the fourth element [5] breaks the inter- 
val [1 — 6] into two parts: the left part [1 — 4] and the 
right part [6] and so on. 

Thus one can think of the construction of the RBST as 
a dynamical fragmentation process where one starts with 
a stick of initial length N and breaks it randomly into two 
parts: a left part of length rN and a right part of length 
r' N with the constraint r + r' — 1 where r is a random 
number distributed uniformly over the interval [0 — 1]. 
At the next step, one breaks each of these intervals again 
into two parts. At any stage of breaking, the random 
variable r characterizing the break point of an interval 
is chosen independently from interval to interval. They 
are also independent from stage to stage. After n steps 
of breaking, there are 2™ intervals. Note that this frag- 
mentation process has itself a tree structure and can be 
represented by a branching process as depicted in Fig. 3. 



N 




i-jrN r 2 rN rjr'N r^r'N 

rp* 

FIG. 3. The fragmentation process has itself a tree struc- 
ture (denoted by T*), shown here up to level 2. In the first 
step an interval of length N is broken into two pieces of lengths 
rN and r'N such that r + r' = 1. Each of those pieces 
are further broken into two halves satisfying the constraints 
ri + r~2 = 1 and ri + Tn' = 1. At level n, there will be 2 n 
pieces. 

A search tree of fixed size N is completed when in the 
corresponding fragmentation process, the lengths of all 
intervals are less than 1 because this means that all the 
elements of the incoming data string have already been 
incorporated onto the search tree. Although in the frag- 
mentation problem we have continuous intervals whereas 
in the RBST the intervals consist of discrete integers, it 
does not really matter since one can associate the inte- 
ger part of a break point to a particular integer element 
of the RBST. For example, if the first break point in 
the fragmentation problem is 7.3, this means that in the 
RBST problem, the first element (the root) is integer 7. 



Let us first consider the height Hpj of the RBST. By 
definition, Hm is the distance from the root (depth) of 
the farthest element in the RBST. The RBST stops grow- 
ing beyond Hjy as all the incoming N elements have been 
incorporated in the tree. Thus when the RBST attains 
the depth -ff/v, in the corresponding fragmentation prob- 
lem, the length of every interval is less than 1. Denote 
by 1%, . . ., the lengths of 2™ intervals after n steps of 
breaking. Clearly, the probability Prob[iJ/v < n] in the 
RBST problem is the same as the probability that all 2™ 
intervals in the fragmentation problem have lengths less 
than 1, 

Prob [H N <n]= Prob [h < 1, . . . , l 2 n < 1] . (3) 

The right hand side of Eq. (||) is also the probability that 
the maximum of the lengths of the 2™ pieces is less than 
1 in the fragmentation problem. 

We next consider the balanced height of the RBST. 
By definition, hjy is the depth up to which the RBST 
is fully saturated and balanced. Beyond this depth, 
some parts of the RBST stop growing (see Fig. 1 where 
hjsr = 3). This means that in the corresponding ran- 
dom fragmentation process, as long as the step number 
of breaking is less than ft, at, the lengths of all the inter- 
vals must still be bigger than 1, so that each such inter- 
val can incorporate a new element. Thus the probability 
Prob[ft,jv > n] in the RBST is the same as the probability 
that all 2 n intervals in the fragmentation problem have 
lengths bigger than 1, 

Prob [h N > n] = Prob [h > 1, . . . , Z 2 - > 1] • (4) 

The right hand side of Eq. (Q) is also the probability that 
the minimum of the lengths of the 2" pieces is bigger than 
1 in the fragmentation problem. 

In the RBST, the new elements in the tree arrive ran- 
domly. The corresponding fragmentation problem is also 
random in the sense that at each stage an interval I is bro- 
ken into two parts of lengths rl and r'l with r + r' = 1 
where the random variable r is chosen each time indepen- 
dently and is distributed uniformly over [0 — 1] . One can, 
of course, generalize this random fragmentation problem 
where the variable r is chosen independently each time 
but with an arbitrary distribution over [0 — 1] , not neces- 
sarily uniform. This would correspond to a BST problem 
where the new elements arrive with a specified distribu- 
tion. In general, at any stage of breaking, the joint prob- 
ability distribution of r and r' can be written as 

Prob [r, r'] = (f>{r)(t>{r')5{r + r' - 1). (5) 

The delta function ensures that the total length is con- 
served at every stage of breaking. The joint distribu- 
tion is written in a symmetric way to ensure that both r 
and r' have the same effective distribution which is given 
by rj{r) — Prob(r) = J Q Probfr, r']dr' — </>(r)</>(l — r). 
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The function <f)(r) must be chosen such that the induced 
distribution rj(r) satisfies the conditions, J r](r)dr = 1 
and Jq 1 rrj{r)dr = 1/2. The first condition condition en- 
sures normalizability of the single point distribution rj{r) 
and the second condition comes from the strict constraint 
r + r' = 1 which indicates (r) = (r') = 1/2. In the case of 
random breaking, the function <j){r) = 1 and consequently 
the induced distribution r/(r) = 1 for < r < 1. A sim- 
ple example of a non-random break point distribution 
is given by, <fi(r) = y/6r with the induced distribution 
r/(r) = 6r(l — r) that satisfies the two constraints p3[ . 

Apart from connection to the BST problem, the ran- 
dom fragmentation problem is interesting on its own 
rights as it arises in various contexts such as the energy 
cascades in turbulence |2^], rapture processes in earth- 
quakes (^9), financial crashes in stock markets p0| , and 
in the stress propagation in granular medium |31[|. In 
our previous paper p3[ , we had studied the asymptotic 
laws governing the probability distribution of the maxi- 
mal lengths of the intervals after n steps of breaking in 
the random fragmentation problem using traveling front 
techniques. The same differential equation that describes 
the Laplace transform of this distribution was also stud- 
ied independently by Drmota via a different method [j37| . 
Both these methods work well for the random problem 
(where rj (r) = 1) but seem difficult to extend to the gen- 
eral case when the break point in the fragmentation pro- 
cess is chosen from an arbitrary induced distribution rj(r) 
p3j. It turns out, however, that the fragmentation prob- 
lem with general r)(r) can further be mapped to a MDP 
problem as presented in the next section. This further 
mapping followed by the traveling front analysis then al- 
lows us to obtain exact asymptotic results for the general 
case with arbitrary 77 (r). 

IV. MAPPING OF THE FRAGMENTATION 
PROBLEM TO A MODIFIED DIRECTED 
POLYMER PROBLEM 

In this section we further map the fragmentation prob- 
lem onto a MDP problem on a Cayley tree. This MDP 
problem turns out to be slightly different from the con- 
ventional DP problem studied in statistical mechanics 
due to the presence of a special constraint. Neverthe- 
less, asymptotic properties in the MDP problem can be 
derived analytically using the traveling front techniques. 

To understand this mapping, consider the set of 2™ 
intervals in the fragmentation problem after n steps of 
breaking, starting from the initial length N. Let Ik de- 
note the length of the k-th interval where k = 1, . . ., 2". 
From Fig. 3, it is clear that the length of any typical piece 
Ik can be expressed as the product 

n 

h = Nl[n, (6) 

i=l 



where r^'s are the set of independent random variables 
encountered in getting the final piece of length Ik after 
n steps of breaking the original interval of length TV. 
Note that in the tree T* in Fig. 3, there is a unique 
path connecting the original interval (the root element 
of T*) to the A:-th interval at stage n and the set of ran- 
dom variables r^'s encountered in going from the root 
of T* to the fc-th piece at stage n defines this unique 
path. Alternately we can associate an energy variable 
e - L = — log Ti > to the bonds connecting this path and 
the set of energies e^'s also uniquely characterize the path 
(see Fig. 4). Taking logarithm in Eq. (||), we see that the 
total energy Ek of a path (starting at the root and ending 
at the fc-interval at the stage n) becomes 

f N\ n " 

E k = log r = - £ log^i = X e - ( ? ) 
\ lk ' i=i i=i 

This path then represents a typical configuration of a di- 
rected polymer (directed in the downward direction) with 
energy given by Eq. (0) where e^'s are random bond en- 
ergies. Note that up to level n, there are a total number 
of 2™ different paths each having different total energies 
Ei, . . ., E 2 ™. 

In the conventional DP problem, the bond energies e^'s 
are completely uncorrelated. To understand why they are 
correlated in the present problem, recall that when an in- 
terval is broken into two parts the random variables r and 
r' characterizing the lengths of the two daughter inter- 
vals satisfy the length conservation constraint, r + r' = 1. 
Translated into the DP problem, the corresponding bond 
energies e = — log r and e' = — log r' associated with the 
two bonds emanating downwards from a given node must 
satisfy the constraint 

e- £ + e- £ ' = l. (8) 

This constraint holds at every branching point of the tree 
(see Fig. 4). This correlation makes the MDP problem 
slightly different from the conventional DP problem. 

The joint distribution p(e, e') of the energies of the two 
bonds emanating from the common node and the induced 
effective single bond distribution p(e) are obtained from 
Eq. (||) to give: 

p(e, e>) = ( e ~ e ) (e-') e-^'S (V e + e^' - l) , 

pOO 

p(e)= p(6,e')*' = 0( e - e )0(l-e- e )e~ £ . (9) 
Jo 

For example, for the RBST we have 4>{r) — 1, and 
therefore p(e) = e _£ . Note that in the conventional 
DP problem, the joint distribution p(e, e') would sim- 
ply be the product of the single point distributions, 
p(e, e') = p(e)p(e') since they are independent. The MDP 
problem, however, lacks this factorization property. 
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A. Statistics of the Height or the Minimum Energy 




FIG. 4. The MDP on a Cayley tree. This tree is isomor- 
phic to the tree of the fragmentation process shown in Fig. 3. 
Each bond energy e is related to the corresponding fraction r 
via e = — log r. The bond energies are correlated due to the 



constraints: 
etc. 



= 1, e 



+ e- 



= 1, e- 



+ e- 



= 1, 



Having set up the notation we turn to the variables 
Hn and in the original BST problem. What do 
the distributions of and hjy correspond to in the 
MDP problem? First, consider the height distribution 
Prob [.Hat < n]. From Eqs. @ and (Q), one finds 



Prob [H N < n] = Prob [h < 1, . . . , l 2 n < 1] 
= Prob [Ei > log N,..., E 2 n > log N] , 



(10) 



where Ek's (k = 1, . . ., 2") are respectively the total en- 
ergies of the all possible 2™ paths going from the root to 
the leaves at the n-th level in the DP problem. The prob- 
ability in the last line in Eq. ( |i"o| ) is also the same as the 
probability Prob [min-j^i , . . . , E 2 n } > log N] . Thus the 
height distribution Prob [Hjf < n] in the BST problem 
is precisely related to the distribution of the minimum 
(ground state) energy of the MDP problem, a quantity 
of considerable interest in statistical physics. 

Let us next consider the balanced height Hn- Using 
Eqs. (^) and (]?]), it follows similarly that, 



Prob [h N > n] = Prob [li > 1, . 
= Prob[£i < log N,..., E 2 n 



.,l 2 n > 1] 

< log TV], 



(11) 



which is also the probability that the maximum energy 
max{i?i, . . . , E 2 n } is less than log N. Thus the balanced 
height distribution Prob [hjf > n] in the BST problem is 
related to the distribution of the maximum energy in the 
MDP problem, a quantity that is usually not of much 
interest in statistical mechanics. 



In this subsection we analyze the asymptotic statistics 
of the height Hm in the BST problem or equivalently 
that of the minimum energy in the MDP problem. Let 
P n (x) = Prob [min{i?i, . . . , E 2 ™} > x], where Ek's with 
k = 1, . . ., 2" are the energies of the 2™ polymer paths 
from the root to the n-th level. It is then easy to write a 
recursion relation for P n (x), 

/>oo />oo 

Pn+l(a?)= / / P n (x-e)P n (x-e')p(e, e')dede', 
Jo Jo 

(12) 

where p(e,e') is the joint distribution of the two bond 
energies as given by Eq. (g). Equation (|l^) has been de- 
rived by analyzing different possibilities for the energies 
of the bonds emanating from the root and using the fact 
that the two subsequent daughter trees are statistically 
independent. Note that in the conventional DP problem, 
the corresponding recursion relation would be simplified 
using the factorization property of the joint distribution 
p(e,e') and one would get [pO|J 



Pn 



I p(e) de 



(13) 



We have to solve the recursion relation ( |l2| ) subject to 
the initial condition 



Po(x) 



1 x < 0, 
x > 0, 



and the boundary conditions 

Pn(x) -» { 



1 x — > — CO, 
x — > co. 



(14) 



(15) 



The recursion relation ( |l2| ) is nonlinear and in general 
difficult to solve exactly. However, its asymptotic prop- 
erties can be derived analytically. As n increases, the 
solution P n {x) in Eq. (|l^ ) looks like a [1 — 0] front (i.e., 
Pn (x) ~ 1 for small x but falls off rapidly to for large x) 
advancing in the positive direction. This suggests that 
for large n, Eq. ([l2]) admits a traveling front solution, 
Pn{x) = F(x — x n ) where x n denotes the location of the 
front and the shape of the front is described by the fixed 
point scaling function F that becomes independent of n. 
This implies that the width of the front is of order O(l), 
i.e., it saturates in the large n limit. The traveling front 
ansatz also indicates that the front advances with a uni- 
form velocity, i.e. x n vn, to leading order for large n 
where the velocity v is yet to be determined. Substitut- 
ing this traveling front ansatz, P n (x) = F(x — vn) for 
large n in Eq. (|l2|), we find that the fixed point function 
F(y) satisfies the nonlinear integral equation, 
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F(y-v)= / F(y-e)F(y- e') p(e, e') de de' , 
Jo Jo 

(16) 

where the velocity v is still undetermined and F{y) sat- 
isfies the boundary conditions 



as y 
as y 



-oo, 



CO. 



(17) 



Let us first analyze Eq. ( fig ) in the tail region y — ► — oo. 
Plugging F(y) = l — f(y) in Eq. ( |l6| ) and neglecting terms 
of order 0(f 2 ) we find that f(y) satisfies 



f(y-v) = 2 f(y-e)p(e)de, 
Jo 



(18) 



where we have used the relation p(e) — J °° p(e, e')de' . 
This linear equation ( |l8|) clearly admits an exponential 
solution f(y) — exp(Ay) provided the inverse decay rate 
A is related to the velocity v via the dispersion relation 



«(A) = --log 



(19) 



For a given induced distribution p(e), the function 
v(X) — > — log(2)/A as A — > and v(X) — > as A — > oo 
with a single maximum at a finite A* determined via 



rfA 



= 0. 



(20) 



Thus for all A such that / °° e~ Xe p(e) de < 1/2, the 
corresponding velocity v(X) > 0. While any such A is in 
principle allowed, a particular velocity is actually asymp- 
totically selected by the front. This velocity selection 
mechanism has been observed in a large class of nonlinear 
problems with a traveling front solution ]l2]- p^ , ^3| , ^9| -^l| . 
It is known that as long as the initial condition is sharp 
(as in the present case in Eq. (fL4|)), the extreme value 
is chosen. From this general front selection principle, we 
infer that in our present problem, the front finally selects 
the velocity v(X*) where A* is given by the solution of 
Eq. (p0|). Thus the asymptotic front position, to leading 
order for large n, is given by 



x n ~ v(X*) n. 



(21) 



While the leading behavior of the front position x n is 
given exactly by Eq. (pi]), it turns out that it has an 
associated slow logarithmic correction. This logarithmic 
correction to the front velocity was first derived by Bram- 
son in the context of a reaction-diffusion equation Jli) , 
and was subsequently found in many other systems with 
a traveling front |l7|JT^ , |3^ . |39| , [l0t . In Appendix A, we 
present a detailed derivation of this correction term fol- 
lowing the approach of Brunet and Derrida |39]. The 
main result of this exercise is that the asymptotic front 
position for large n is given by 



x n ps v(X*)n + — logn. 



(22) 



One can even calculate the next correction term by em- 
ploying a more sophisticated approach |l8[ ] but we omit 
these results here. One important point to note is that 
while the velocity v (A*) and A* are nonuniversal as they 
depend explicitly on the distribution p(e), the prefactor 
3/2 of the logarithmic correction in Eq. (22) is actually 
universal and is precisely the first excited state energy of 
a quantum harmonic oscillator (see Appendix A). 

Let us now translate back these results to see what 
they mean for the height distribution in the original BST 
problem. From Eq. (|l0|), it is clear that the cumulative 
height distribution for large n is given by 

Prob [H N <n] = P n (log N) « FQog N ~ x n ), (23) 

where the front position x n is given by Eq. ( p2[) and the 
function F(y) is given by the solution of Eq. (|16|). Since 
the function F(y) has the shape of a front with center 
at y — and width of order 0(1), its derivative F'(y) 
is a localized function around y = with width of order 
0(1). From Eq. ( ^3| ) it then follows that the height distri- 
bution Prob [Hn = n] is also localized around its average 
value (iijv) with a variance V(Hn) ~ 0(1). Thus Hn 
has a concentration of measure around its average value 
(iJjv) which is given by the value of n that corresponds to 
the zero of the argument of the function F(y), i.e., when 
x n = logA^. Using x n — log TV in Eq. (|22| ) and solving 
for the required value of n for large N, we obtain one of 
our main results 



(H 



1 



v(X*) 



\osN 



2X*v(X*) 



log (log TV), (24) 



where v(X) and A* are given respectively by Eqs. ( |l9| ) and 
(pp|). This is the first result for the fragmentation prob- 
lem with arbitrary break-point distribution going beyond 
the uniform case or equivalently for the BST problem 
where the elements in the tree arrive with an arbitrary 
distribution and not just randomly. 

It is useful to exemplify the above general results. For 
the original RBST problem, <p(r) — 1 or p(e) = e~ £ [see 
Eq. (0)]. Substituting p(e) = e~ e into Eq. (Ill) we get 



A + l 



(25) 



which has a single maximum at A* = 3.31107... with 
v{X*) = 0.23196.... Substituting A* and v{X*) into 
Eq. (|J) we arrive at Eq. ^ with a = 4.31107 ... and 
a\ = —1.95302 . . ., in agreement with Refs p3| , ^5| . 

Consider another example, 4>(r) — y6r, a problem 
that couldn't be solved by the techniques used in our 
previous short paper p3[ . This corresponds to the frag- 
mentation problem where the induced distribution of the 
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break-point is rj{r) — 6r(l — r). In the MDP problem, it 
corresponds to the induced bond energy distribution 



p(e) = 6 e - 2£ (l-e- e ). 
Substituting this form in Eq. (|l9|), we get 
1 



v(X) = - vlog 



12 



(A + 2)(A + 3) 



(26) 



(27) 



which has a unique maximum at A* = 3.92408 . . . and 
this maximum velocity is given by v(X*) = 0.31322 . . .. 
Substituting these results into the general formula ( p4| ) 
we recover Eq. ([!]) with = 3.19258... and a\ = 
-1.22038.... 

The traveling front approach also gives the full proba- 
bility distribution of the height variable in the BST and 
not just its exact average value as in Eq. (HJ). Indeed, 
we have seen that cumulative height distribution is given 
by Eq. ( p3| ) where the function F(y) is the solution of 
the boundary value problem (|l6|) — ( IT). While we have 
not been able to solve the nonlinear integral equation 
( |l6| ) exactly, one can easily deduce the extreme behavior 
of F(y). We have already seen that in the tail region 
y — > — oo, the function F(y) saturates to 1 exponentially 
fast, 1 — F(y) ~ exp[A*y], where A* is the solution of 
Eq. (|2(i|). One can also deduce the asymptotic behav- 
ior of F{y) when y — > oo (Appendix B) for arbitrary 
distribution p(e) . Thus the asymptotic behaviors of the 
function F(y) read 



F{y) 



1 - Ae x ' y 

2 /J 50 p(v' + v ) d v' y^°°> 



v 



(28) 



where A is a constant, A* is found from (|20|), and 
v = v(X*). In particular, for the RBST where p(e) — e~ c , 
A* = 3.31107, and v(X*) = 0.23196, one has 



F(y) 



1 - Ae 3 - 31107v 
1.58596e~« 



— oo, 
oo. 



(29) 



In conclusion, the height distribution is a localized 
function around its average value (-Hjv) given by Eq. (24). 
For any unbounded distribution p(e), the height distribu- 
tion decays at in the tail regions according to Eq. (28). 
For bounded distributions, however, F{y) vanishes for 
sufficiently large y. Recall that the distribution of the 
minimum of a set of uncorrelated random variables is 
known to have a universal superexponential decay for 
large value |2^J. However it was shown in Ref. 41 that 
in the conventional DP problem the distribution of the 
minimum energy of a polymer violates this Gumbel law 
due to hierarchical correlations between the energies of 
different paths. From Eq. (Eq) it is clear that in the MDP 
problem the forward tail is nonuniversal since it depends 
explicitly on the distribution p(e). Generally, the for- 
ward tail is not superexponential thus clearly violating 
the Gumbel statistics. 



B. Statistics of the Balanced Height or the 
Maximum Energy 

The analysis of the statistics of balanced height (/ijv) 
follows more or less the same approach as in the case 
of height variable, except that one is now concerned 
with the distribution of maximum energy in the MDP 
problem. Let R n (x) = Prob [max{£a,, E%, . . . , i? 2 «} < x] 
where E^s with k = 1, 2, . . ., 2™ are the energies of the 
2" polymer paths from the root to the n-th level. Then 
Rn(x) satisfies the same recursion relation as the P n (x) 
in Eq. 

p oo poo 

R n+ i{x) = j j R n (x - e)R„(x - e')p(e, e')dede' . 
Jo Jo 

(30) 

The only difference is in the initial condition, 

*,(*)={; m (3D 



and in the boundary conditions 



x — > — oo, 
x — ► oo. 



(32) 



As in the case of Eq. (|12|), the recursion relation (|30j ) 
admits a traveling front solution for large n, R n {x) = 
G(x ~ x n *) where x n * is the front position and the fixed 
point scaling function function G(x) describes the shape 
of the front. Unlike the [1 — 0] front in the previous sub- 
section, the front for R n (x) has a [0—1] form advancing 
in the positive direction. The front again advances with 
asymptotically constant velocity v\, i.e., the position of 
the front is x n * » V\n. Substituting R n {x) — G(x-vin) 
in Eq. (^fj|), we find that G{y) satisfies the nonlinear in- 
tegral equation 



OO p oo 



G(y-v 1 )= / G(y - e)G(y - e')p(e, e')dede' . 
Jo Jo 



(33) 



The velocity v\ is still undetermined and the front shape 
G(y) satisfies the boundary conditions, G(y) — > as 
y — > — oo and G(y) — > 1 for y — > oo. As in the previ- 
ous subsection, we will analyze the Eq. ([||) in the tail 
where G(y) — > 1; in the present case, this means y — > oo. 
Substituting G{y) = 1 — g(y) in Eq. ( ^3| ) and neglecting 
terms of order 0(g 2 ) we get the linear equation 



g(y - wi) = 2 / g(y - e) p(e) de. 



(34) 



Equation (|3J) admits asymptotically exponential solu- 
tion, g(y) = exp(—py) as y — > oo, with 
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(1 



log 



e p(e)de 



(35) 



The dispersion relation in Eq. ( pq ) has a single minimum 
at n = fx* determined from relation 



0. 



(36) 



By the general front selection mechanism, we infer that 
this minimum velocity will be selected by the front 



vi(n*)n. 



(37) 



The associated slow logarithmic correction can also be 
worked out following the same calculation as in Appendix 
A and we finally get 



x n w )n 



2/z* 



logn. 



(38) 



Note that the correction term in Eq. (^) has a negative 
sign compared to the positive sign in Eq. (|22]). 

In terms of the BST problem, it is clear from Eq. jTl| ) 
that the cumulative balanced height distribution for large 
n is given by 

Prob[h N > n] = R n {\ogN) « G{\ogN - x n *), (39) 



where the front position x n * is given by Eq. fl38| ) and the 
function G{y) is the solution of Eq. (|3|). As argued in 
the previous subsection, the derivative G'(y) is a local- 
ized function around y — with width of order O(l). 
Thus the balanced height distribution Prob[/ijy = n ] is 
also localized around its average value (/ijv) with a vari- 
ance V(h]y) ~ 0(1). The average value reads 



log A 



log (log N) . (40) 



Consider again the same examples that were stud- 
ied for the height variable in the previous subsection. 
For the RBST problem where (j)(r) = 1 or equivalently 
p(e) = e _e , Eq. (|35|) becomes 



(41) 



which has a single minimum at /i* = 0.62663 . . . where 
= 2.67834. . .. Thus, Eq. © reduces to Eq. §) 
with a ' = 0.37336 ... and at' = 0.89374 . . .. 

For the second example, 6{r) = \^6r or equivalently 
p(e) given by Eq. @), Eq. (^) becomes 



= - log 



12 



(2- M )(3-/i) 



(42) 



The function «i(/i) in Eq. ( p2| ) has a single minimum at 
(i* = 1.17864... where Vi(/7) = 1-76653.... Equation 



© again reduces to Eq. (|) with a ' = 0.56607 ... and 
ax = 0.72041.... 

Finally we explain the duality between Hn and /ijv in 
the BST problem. In the language of the MDP problem, 
these variables correspond to the minimum and maxi- 
mum energy of a directed polymer in a random medium 
where the bond energies q's have nonzero support only 
for €i > 0. Changing the sign of the bond energies maps 
the minimum energy in the negative support problem 
into the negative of the maximum energy in the positive 
support problem. This fact is reflected in the relation be- 
tween the two dispersion relations in Eqs. ( [l9| ) and (|3^), 
v(— A) = vi(X). Thus A* and — //* are actually the two 
different roots of the same transcendental equation ( |2C|) . 
Consequently, the constants ao and ao' in Eqs. ([!])- (|2j) 
are merely two different roots of the same transcendental 
equation. 



V. GENERALIZATION TO M-ARY SEARCH 
TREES WITH ARBITRARY DISTRIBUTIONS 

The results obtained in the previous sections for the 
statistics of Hn and Km of the BST's with arbitrary dis- 
tributions can be generalized in a straightforward manner 
to the TO-ary search trees. An m-ary search tree is con- 
structed in the following way. One first collects the first 
(to — 1) elements of the incoming data string and arranges 
them together in the root of the tree in an ordered se- 
quence X\ < . . . < x m -\. Next when the m-th element 
Xm comes, one compares first with X\. If x m < Xi, the ru- 
th element is assigned to the root of the leftmost daughter 
tree. If X\ < x m < X2, then x m goes to form the root of 
the second branch and so on. Each subsequent incoming 
element is assigned to either of the m branches accord- 
ing to the above rule. Note that the level of the tree will 
increase beyond a given node only when the node gets 
filled beyond its capacity of (m — 1) elements. Thus in 
the m-ary search tree, each node will contain at the most 
(to — 1) elements. 

The mapping to the fragmentation problem goes 
through following the same line of arguments used for 
the binary tree in Sect. III. In this case, one starts with 
an interval of size and breaks it into m pieces. Sub- 
sequently each piece is further broken into m pieces and 
so on. When an interval is broken into to pieces, each 
of the new pieces is a fraction of the original piece. The 
lengths of these to new pieces are characterized by a set of 
to random numbers {r\, . . . , r m } such that r; = 1 

thus enforcing the length conservation. For each interval 
a new set of ?Vs are chosen from the same joint proba- 
bility distribution 



Prob[ri, . 




(43) 
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As in the binary case, the distribution ( filf ) is written in a 
symmetric form. Note that each new piece has the same 
effective induced distribution rj(r) given by the integral 
j dri . . . J* dr m Prob[r, r 2 , . . . , r m ], or 



T](r) = 4>(r) 



/ ■••/ ME' 

J o Jo \ i=2 



i=2 



The function <fi(r) must be chosen such that rj(r) satisfies 
the conditions, J Q rj(r)dr = 1 and J Q rrj(r)dr = 1/m. 

The random m-ary search tree corresponds to a ran- 
dom fragmentation problem where each of the fractions 
ri, . • ., r m _i is chosen from a uniform distribution be- 
tween and 1, setting r m = 1 — Y^iLi 1 r *' anc ^ then keep- 
ing only those sets where r m > 0. This is precisely the 
so-called 'uniform' distribution used by Coppersmith et. 
al. in the context of the g-model of force fluctuations 
in granular media |}l|. In this case, <fi(r) is a constant 
chosen in such a way that the joint distribution ([l3|) is 
normalized. One finds 31 



Prob[ri 



!•••!' m. 



= (m-l)«($%,-l). (44) 



The corresponding effective single point distribution r](r) 
reads @ 



(45) 



Another interesting distribution is (j>(r) oc r. In this 
case, the normalized joint distribution is given by (see 
Appendix C) 

(m \ rn 

2=1 / i=l 

The corresponding effective distribution r](r) can be de- 
duced by recursively method (as shown in Appendix C) 
and we get 



r](r) = (2m - l)(2m - 2)r(l - r) 



2m-3 



(47) 



Note that for m = 2, it reduces to rj(r) = 6r(l — r) which 
was studied in detail for the binary case in section III. 

The m-piece fragmentation problem for the special 
case of uniform distribution ( fi"4"| ) was studied in Ref . |}3| . 
However, as in the binary case, this method is not easy 
to extend to handle the general distribution rj(r) includ- 
ing, for example, the distribution (E^). To go beyond 
the uniform case, we first map the fragmentation prob- 
lem into the MDP problem as in the binary case. One 
proceeds exactly as in the binary case by associating an 
energy q = — log ?-j to each bonds of a directed polymer 
going from the root to the leaves of a Cayley tree, but 
now with m daughters emerging from each node. The 
energies of the m bonds emanating downwards from any 



given node are correlated due to the relation YITLi r i — 1 
which translates into the constraint 



2 = 1 



1. 



(48) 



As in the binary case, this constraint holds at every 
branching point of the tree. The joint distribution 
p(e\, . . . , e m ) is found from Eq. (E3) to give 



(49) 



Also the induced bond energy distribution p{e) is related 
to the induced fraction distribution rj(r) via 

/>oc />oo 

p(e) = \ ... p(e,e 2 , ■ . . ,e m )de 2 . . .de m 
Jo Jo 

= Tj (e~ e ) e~ e . (50) 

On this m-branch Cayley tree, there are a total of m n 
possible paths of the directed polymer going from the 
root to the leaves at the n-th level. Following arguments 
similar to the binary case, the cumulative height distri- 
bution in the m-ary search tree is related exactly to the 
distribution of the minimum energy of the m™ polymer 
paths in the MDP problem via 

Prob [H N <n] = Prob [h < 1, . • • , l m n < 1] 

= Prob [E x > log N, ... , E m n > log N] , (51) 

where E^s (k = 1, 2, . . ., m") are respectively the total 
energies of the all possible m™ paths. Similarly the cu- 
mulative distribution of the balanced height is related to 
the distribution of the maximum energy of the polymer 
paths via 

Prob [h N > n] — Prob [h > 1, . . . , Z m » > 1] 

= Prob [Eh < log A, ... , E m n < log N] , (52) 



A. Statistics of the Height 

Let P n {x) = Prob [min{£'i, . . . , E m n} > x\. This dis- 
tribution satisfies the recursion relation 

/>00 />OG m 

P n (x) = ... p(ei,...,e m )T\P n ^i(x- ei)dei, 
Jo Jo i=1 

(53) 

where the joint distribution p(e\, . . . , e m ) is given by 
Eq. ( p9| ) . The recursion starts with the same initial con- 
dition as in Eq. (Jl4|) . The rest of the analysis is exactly 
same as in the binary case. Substituting a traveling front 
solution, P n (x) = F(x — vn) in Eq. J53] ) and then lin- 
earizing near the tail y — > —oo, we find as in the binary 
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case, F(y) ~ 1 — e Xy where the velocity v of the front is 
related to A via the dispersion relation 



v(X) = -^log 



,-A< 



p(e) de 



(54) 



where the induced distribution p(e) is given by Eq. (50). R n {x) = 



The front velocity is then given by the maximum v(X*) 
of the dispersion curve in Eq. ( |54| ) and is obtained by 
solving Eqs. (^0|) and (|54|). Similarly one can also work 
out the logarithmic correction to the front velocity and 
the asymptotic front position is given by the same for- 
mula in Eq. (|22|), only A* and v(\*) are different from the 
binary case. Similarly the average height (Hn) for the 
TO-ary search tree is also given by the same formula as in 
Eq. (pi|), only change is in the dispersion curve v(X). 

Let us now present some specific results. For the uni- 
form distribution, p(e) = (m — 1) [1 — e _e ] m_2 e _e as fol- 
lows from Eqs. ( f45|) and (g0|). Substituting this into the 
dispersion relation ( |5^ ) yields 

v(\) = -~ log [m(m- l)B(X + l,m- 1)] , (55) 
A 

where B(m,n) is the Beta function. For instance, for 
to = 3 the velocity v(X) has a single maximum at A* = 
3.48985 . . . with v(X*) = 0.40487 . . .. Plugging these in 
the general formula (|24|) we again arrive at Eq. (|l]) with 
a = 2.4698 . . . and a x = -1.0616 . . .. 

Consider now the large to limit. Using asymptotic 
properties of the Beta function, one gets 



A* ~ log to, v(X*) w log(m/A*). 



(56) 



Therefore when m — > oo, the average height is given by 
Eq. (11 with 



1 



a 



log(m/X*) 



IX* log(m/A*) 



(57) 



Similarly for the distribution @, Eqs. (||) and Q) 
lead to the following dispersion relation 



v(X) 



log [to(2to - l)(2m - 2)B(X + 2, 2to - 2)] 



For m = 3, we get the maximum at A* = 4.17886... 
with v(X*) — 0.53235.... The average height is given 
by Eq. ([[]) with a a = 1.87845 . . . and a t = -0.67427 . . .. 
The large m behavior turns out to be exactly the same as 
in the case of uniform distribution. One can work out the 
large to asymptotics for arbitrary distribution rj (r) (see 
Appendix D) and one gets the same asymptotics d56| ) as 
in the above examples. Therefore, the asymptotic behav- 
ior of (Hn) is universal (independent of the details of the 
distribution) in the large to limit. 



B. Statistics of the Balanced Height 

As in the binary case, we again utilize the distribution 
R n {x) = Prob[max{i?i, E 2 , ■ ■ ■ , E mn } < x}. This distri- 
bution satisfies the recursion relation 



P(ei, 



i) Y\_Rn-i(x - e i )de l 



(58) 



and the same initial and boundary conditions (31)-(k3 



as in the binary case. Plugging a traveling front solution 
R n (x) = G(x — v\n) into Eq. ( |58"|) and linearizing in the 
tail region y — > oo according to G(y) « 1 — e~ m , we 
arrive at the dispersion relation 



1 



= - log 
A* 



to / e^p(e)de 



(59) 



where the induced distribution is given by Eq. ( |50| ) . The 
front velocity is then selected by the minimum vi(/i*) 
of this dispersion relation. Proceeding as in the binary 
case, the asymptotic front position is given by the same 
general formula in Eq. (|3^), the only difference is that p* 
and V\(p*) are different from the binary case. Finally the 
average balanced height (hpj) for the m-ary search trees 
is also given by the same general formula in Eq. ( pp[ ) , the 
only difference being the dispersion relation v\(p). 

For the uniform distribution, E[q. (fi"5|)], we reduce 
Eq. (||) to 

vi(p) — — log [to(to — 1)£>(1 — p,, m — 1)1 . (60) 

Equation ([30]) can also be obtained from Eq. (^5|) by 
changing the sign of A = —p as expected. For example, 
for to = 3, the dispersion relation ( |60| ) has a unique min- 
imum at p* = 0.68189... where vi(p*) = 3.90227.... 
Then the general formula ( f40| ) reduces to Eq. (||) with 
a' Q = 0.25626 . . . and a[ = 0.56371 . . .. 

For the distribution ([l7]) , the dispersion relation reads 

Vl (p) = - log [to(2to - l)(2m - 2)B(2 - p, 2m - 2)} . 
p 

One has p* = 1.28665... and Vt{p*) = 2.62334... in- 
dicating in the particular case of to = 3, so in this sit- 
uation the averaged balanced height is given by Eq. (^) 
with a' = 0.38119 ... and a{ = 0.44440 . . .. 

One can also work out the large to behavior for ar- 
bitrary distribution r/(r) (Appendix D). Unlike the case 
of the height variable, the large to behavior in the case 
of balanced height is nonuniversal and depends explic- 
itly on the small r behavior of the distribution rj(r). If 
T](r) ~ r a as r — + 0, then (see Appendix D) p* w a+1 and 
vi(p*) ps ^jpjlogTO. Both these quantities, and hence 
the average balanced height, depend on the parameter a. 
Therefore, the balanced height remains nonuniversal in 
the large to limit. 
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VI. CONCLUSIONS 

In this paper we studied the statistics of height and 
balanced height in the BST problem by exploiting a two 
stage mapping "the BST problem — > fragmentation prob- 
lem — > the MDP problem" and then using the traveling 
front techniques to solve the MDP problem. While the 
first mapping has been used previously to obtain exact 
asymptotic results for RBST problem, the second map- 
ping allowed us to go beyond random trees and obtain 
exact asymptotic results for BST's where the new entries 
arrive in the tree according to any arbitrary distribution, 
not necessarily randomly. The fact that the traveling 
wave techniques, used previously in nonlinear physics, 
can be used successfully in computer science problems is 
not just interesting but it allows us to obtain many new 
informations such as the shape of the full distribution of 
height and not just its moments. It would be interesting 
to apply these techniques to more sophisticated search 
algorithms in computer science. 
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APPENDIX A: DERIVATION OF THE 
LOGARITHMIC CORRECTION TO THE FRONT 
POSITION 

In this Appendix, we present a detailed derivation of 
the logarithmic correction to the asymptotic front posi- 
tion. We employ the approach of Ref. ||{| where such 
a correction was computed for a reaction diffusion equa- 
tion. In the present context, our starting point is the 
recursion relation in Eq. ( |53"|) for the m-ary search trees. 
We first substitute P n (x) = 1— f n (x) in Eq. ( |53| ) and then 
neglect terms of order 0(f 2 ) in the regime x — > — oo to 
get a linear equation 



fn+i(x) = m f n {x- e)p(e)de, 



(Al) 



where p(e) is the effective induced distribution p(e) is 
given by Eq. (|50|) . Next we assume that for large n the 
front position is given by x n = vn + c(n), where both 
the velocity v and the functional form of the correction 
term c(n) are yet to be determined. Following Ref. j39|, 
we then assume that for large n the solution f n (x) of 
Eq. (Al) is given by the scaling form 



f n (x) = n'H 



Xn \ \{x—x n ) 



(A2) 



where the exponent 7 and the scaling function H(y) are 
not yet known. We only know that H(y) — > as y — > ±00 
(since < f n (x) < 1 for all x). Also, since for large n, 
the prefactor n 1 in Eq. ( |A2j ) must go away, indicating 
that H(y) ~ y as y — ► 0. 

Let us define z n = (x — x n )/n' y . Then to leading order 



for large n, one has z n+ i 



Substi 



tuting z n+ i in Eq. (|A2|) and keeping only leading order 



terms we get for the left-hand side of Eq. (Al) 



f n+1 (x) « nV<*-*»-«>[ir(*) - - lzH'{z) 

l-^(n))H( Z ) + ^H» {z)] . (A3) 

In the above equation, we used the shorthand notations 
z n = z, H'(z) = dH/dz, and H"(z) = d 2 H/dz 2 . 

Similarly, inserting Eq. ( A2) into the right-hand side of 
Eq. ( |Al| ), expanding H[(x — x n — e)n~ 7 ] in Taylor series 
in ee and keeping only leading order terms, we find 



the right-hand side of Eq. (Al) 



f n+1 (x) w mn^e x(x - x -\p H{z) - ^H'( 



2n 2 ~< 



H"(z)}, 



(A4) 



where pt = e k e~ Xe p(e)de. Comparing the left-hand 
side given by Eq. (A3) and the right-hand side given by 
Eq. (A4), we recover, to leading order for large n, the 
dispersion relation 



-\v 



-At 



p(e)de. 



(A5) 



As argued before, the front will choose the maximum ve- 
locity i>(A*) of the dispersion relation (A5). At A = A*. 
v'(X*) = 0. Differentiating Eq. (A5) with respect to A 
we o bta in v(X*) exp[— A*t;(A*)] = mp\. Using this in 
Eq. ( |A3[) shows that the term of order n -7 in Eq. (A3) 
cancels the corresponding term on the right-hand side 
in Eq. (A4). To ensure that remaining terms are of the 
same order, we must have 7 = 1/2 and dc/dn = b/n. 
The latter equation gives c(n) = &logn, where b is still 
undetermined. Employing these choices for 7 and c(n) 
and equating Eqs. (AS) and (A4), we obtain 



me x * v p 2 



) H"(z) - zH'(z) + (1 - 2bX*)H(z) = 0, 



where v = v(X*). This equation can be further simplified 
as follows. Differentiating Eq. (A5) twice with respect 
to A and using v'(X*) = we get an additional relation, 
v 2 (X*)-mp 2 exp[X*v(X*)} =X*v"(X*). By inserting this 
into the above equation we finally arrive at the eigenvalue 
equation 



X*v"(X*)H"{z) + zH'(z) + {2bX* - l)H(z) = 0. 



(A6) 
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Note that v(X) has a maximum at A 

^/~X*v"(X* 



-- A* indicating 
£, we find that 
oo is given 



v"(X*) < 0. Rescaling z 
the solution of Eq. ( A6) that vanishes at ( 
by H(Q = Be^^ / 4 £>2&a*-2(C)> where B is a constant 
and -Dp(C) is the parabolic cylinder function of index p. 
The condition that H (£) ~ C as C ~~ * enforces the choice 
of the index p = 2bX* - 2 = 1 indicating b = 3/2A*. Note 
that the above solution describes precisely the wave func- 
tion of the first excited state of a quantum harmonic os- 
cillator and the factor 3/2 is the corresponding energy 
eigenvalue. Finally, the leading asymptotic behavior of 
the front position is given by 



v{X*)n- 



2X 



■ log n. 



(A7) 



A similar calculation can be carried out for the bal- 
anced height where one finds a dispersion relation v\ (/i) 
as given by Eq. (EtJ) and front position is given by 



- — logn, 



(A8) 



where \i* denotes the point where v\(p) has its unique 
minimum. 



APPENDIX B: ASYMPTOTIC BEHAVIOR OF 
THE CUMULATIVE HEIGHT DISTRIBUTION 

In this Appendix, we derive the large y behavior of the 
cumulative height distribution F(y). The function F{y) 
is the solution of the boundary value problem (|l6|)-(p7|). 
We already know that 1 — F(y) ~ e A v as y — > — oo, 
where A* denotes the value of A where the dispersion 
curve v(X) in Eq. ( |l9| ) has its maximum. In order to de- 
rive the asymptotic behavior of F(y) in the other limit 
y — > oo, we first recast the integral equation ( |l6| ) in a 
slightly different form. Let us first define the cumulative 
distribution function 



Y(e,e') 



p(xi,x 2 ) dxi dx 2 , 



(Bl) 



where the joint distribution p(xi, x 2 ) is given by Eq. 
Writing p(e, e') = d 2 Y/dede' on the right-hand side of 
Eq. ( |l6| ) and performing the integrations by part (first 
over e and then over e'), we finally arrive at the following 
equation 

y* oo p oo 

F(y~v)= / F'(y-e)F'(y-e')Y(e, e')dede' 
Jo Jo 

poo 

(B2) 

/o 



/•OO 

-2/ F(y-e)p(e)de-l, 
Jo 



where F'(y) = dF/dy and we have used the boundary 
conditions of F(y). Note that due to the concentration 



of measure, F(y) has roughly the shape of the step func- 
tion, F(y) w 0(—y) with the front located at y = 0. Thus 
the derivative roughly behaves as a negative delta func- 
tion, F'(y) rj —6(y). First reconsider the limit y — * — oo. 
In this limit, the arguments of the functions F'(y) inside 
the integrands in the first term on the right-hand side 
in Eq. ( p32j ) are always very large and negative, indicat- 
ing that the contribution from this term is negligible as 
y — > — oo. Neglecting the first term, one finds that the 
resulting linear equation admits the exponential solution 
1 — F(y) e Xy where v depends on A through the disper- 
sion relation in Eq. (19). Thus one recovers the correct 
result in the y — > — oo limit. 

Turn now to the complementary limit y — > oo. Then 
the arguments of F'(y) inside the integrands of the first 
term on the right-hand side of Eq. (B2) can be close to 
zero to pick up a substantial contribution. For large y, 
one can approximate F'(y) ~ —8{y ) inside the integrands 
on the right-hand side of Eq. ( |B2] ) and one then gets 



F{y - v) w - 1 
+ 2 



Y(y,y) 

DO 

F(y-e)p(e)de. 



(B3) 



Y(y, y) — * 1 and F(y) — > as y — > oo. To find the asymp- 
totics of F(y) we differentiate Eq. (B3) with respect to y 
and use F'(y) ~ — S(y) in the second term. This gives 



F'(y-v)n-2p(y)+2 



dY(y,y 2 ) 



dy 2 



V2=y 



Using the definitions in Eqs. (Bl) and we find 
dY 
dy 2 



— = -e- y2 4>{e~ V2 ) (t){l - e~ V2 ) 



-Vl 



~V2 



(B4) 



!)• 



When yi = y 2 is large, the argument of the step function 
in the above equation is always negative, indicating that 
one can neglect the second term on the right-hand side 
of Eq. (p4j). This gives F'(y) w -2p(y + v). Hence the 
desired large y behavior of F(y) is given by 



F(V) 



p(y' + v) dy', 



(B5) 



where v = v(X*) is the maximum velocity associated with 
the dispersion relation (19). 

Note that the constraint e~ £ + e~ e = 1 does not mod- 
ify the form of the dispersion curve when compared to 
the unconstrained conventional DP problem [the only 
difference is that one has to first find the effective sin- 
gle point energy distribution p(e) in the constrained case 
from Eq. (^)]. However, the above constraint does mod- 
ify the large y behavior of the cumulative distribution 
F(y). For example, Eq. (B4) is valid for the uncon- 
strained problem as well. However in the unconstrained 
case, Y(y,y) = [f£ p(e) de] 2 . In that case one finds af- 
ter taking the derivative, F'(y) w —2p(y 
indicating that for large y 
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F(y) | unconstrained « 2 / dy' p(y' + v) / p(e)de. (B6) 

For example, for the RBST where p(e) = e~ £ , the large 
?/ asymptotics are F(y) ~ e~ y (constrained case) and 
F(j/) ~ e~ 2y (unconstrained case). 



APPENDIX C: DERIVATION OF THE 
INDUCED DISTRIBUTION 



In this Appendix we derive the induced distribution 
i](r) [see Eq. (p7j)] starting from the joint distribution 



Prob[ri, 



(CI) 



The constant A m in the above equation has to be cho- 
sen such that the joint distribution is normalized. The 
induced distribution 77(f) is obtained by fixing one of the 
fractions, say the first one, to the value r and then inte- 
grating over all other fractions. Thus by definition 



r)(r) = A m r 



/ ■■■/ m£'< 



1 n* (C2) 



Note that fj's denote the lengths of m intervals with 
the total length equal to unity. Let us define a set 
of new variables, x 2 = f + r%, X3 = x 2 + f3, . . ., 
x m -\ — x m -2 + r m -i. Here Xi's denote the points sep- 
arating adjacent intervals. Clearly then x m -\ = 1 — r m 
since the total length is unity. With these change of vari- 
ables the integral in Eq. ( |C2| ) becomes 



t](r) = A m r( m (r), 
where Cm( r ) is given by 



(C3) 



r-i pi 
(m(r) = I {x 2 -r)dx 2 \ (x 3 - x 2 )dx 3 . . . 

a)(l - x m -i)dx m ^i. (C4) 

Thus Cm(f) satisfies the recursion relation 
r-i 



Cm(r) 



(x 2 - r)C m _i(x2)da:2- 



(C5) 



One directly computes C2M = 1 — f and (3(f) = 
(1 — r) 3 /6 which suggests to seek a solution in the form 
Cm( r ) = B m (l — r) 2m ~ 3 . Plugging the above expression 
in recursion (|C5| ) yields 



fin 



(2m-3)(2m-4) : 



(C6) 



which is iterated to give B m — 1 /(2m — 3)!. Thus we 
obtain 77(f) = A m r(l - r) 2m ~ 3 /(2m ~ 3)!- The normal- 
ization condition rj(r)dr — 1 then gives A m — r(2m) 
where T(x) is the Gamma function. Therefore 

77(f) = (2m - l)(2m - 2)r(l - r) 2 ™" 3 , (C7) 

which is valid for all m > 2. 



APPENDIX D: LARGE M RESULTS FOR 
ARBITRARY DISTRIBUTION 

In this appendix we derive the large m behavior of 
(Hn) and (Hn) for m-ary search trees with arbitrary 
distribution 77(f). We start with the height variable and 
write the dispersion relation 



-At; 



/>OC 

m / e~ Xe p(e)de 
Jo 



m r 77(f) dr. 
Jo 



(Dl) 



The constraint ^2 r i = 1 leads to rrj(r)dr = 1/m. 
Thus for large m, a generic distribution 77(7") will have 
be concentrated near r = 0. Consider a class of dis- 
tributions which behave as 77(f) ~ C m r a e~ bmV near 
the origin. For example, C rn = m — 1, a = 0, and 
b m = m — 2 for the uniform distribution ([45|). Simi- 
larly, C m = (2m — l)(2m — 2), a = 1 and & m = 2m — 3 
for the distribution (|47|). These two examples suggest 
that C m ~ m a+1 and 6 m ~ m. Making use of the con- 
straints Jq 1 rj(r)dr — 1 and J Q rrj(r)dr — 1/m one indeed 
confirms the above asymptotics: b m s» (a + l)m and 

c m «c +1 /r( a + i). 



We now consider the integral in Eq. (Dl). Substitut- 
ing the small r behavior of 77(f), performing the integral, 
and using the Stirling formula one gets 



r(o + i; 



VMA + a) 



A + a 



A+a 



(D2) 



Taking the logarithm, differentiating with respect to A, 
and setting v'(X*) — we determine A* and v(X*). The 
leading contributions are given by Eq. (|5^). Therefore 
the large m behavior of (H n ) is indeed universal. 

We now turn to the large m behavior of the average 
balanced height (/ijv). In this case, the appropriate dis- 
persion relation is given by Eq. (]59|): 



(D3) 



m / e» e p(e)de 



m I r M 77(f)(if. 
Jo 



Substituting the small r behavior, 77(f) w C m r a e bmT , 
and performing the integral we obtain 
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r(o + i) 



r(a + l-/i). 



(D4) 



We will see that in the large m limit, /i* — > a + 1. Hence 
we write /i — ► a + 1 — <5, assume that J « 1, plug these 
in Eq. (D4) and take the logarithm to obtain 



(a + 1 - S)(vi - log6 m ) w log 



r(a + i) 



log 6. (D5) 



Differentiating Eq. (D5) with respect to 5 and setting 
v'(S*) = yields: 



«(A*) 



a - 



1 



1 



logm 
logm + 



The parameter a appears in the leading order even in the 
large m limit. Consequently, (hpf) also depends on a and 
thence the balanced height is not universal in the large 
m limit. 
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